Skip to content

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Jul 1, 2022

Tips

What is the purpose of the pull request

This PR optimizes file-listing sequence of the Metadata Table to make sure it's on par or better than FS-based file-listing

Change log:

  • Cleaned up avoidable instantiations of Hadoop's Path
  • Replaced new Path w/ createUnsafePath where possible
  • Cached TimestampFormatter, DateFormatter for timezone
  • Avoid loading defaults in Hadoop conf when init-ing HFile reader
  • Avoid re-instantiating BaseTableMetadata twice w/in BaseHoodieTableFileIndex
  • Avoid looking up FileSystem for every partition when listing partitioned table, instead do it just once

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codope codope added priority:blocker Production down; release blocker metadata area:performance Performance optimizations labels Jul 12, 2022
@nsivabalan
Copy link
Contributor

hey @alexeykudinkin : can you link the right jira for the patch.

@alexeykudinkin alexeykudinkin changed the title [WIP] Optimizing file-listing sequence of Metadata Table [HUDI-4465] Optimizing file-listing sequence of Metadata Table Jul 25, 2022
@codope codope added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Jul 25, 2022
@xushiyan xushiyan added priority:blocker Production down; release blocker and removed priority:critical Production degraded; pipelines stalled labels Jul 27, 2022
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BUCKET).withBucketNum("1").build())
.build();

Properties props = getPropertiesForKeyGen(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass HoodieTableConfig.POPULATE_META_FIELDS.defaultValue() instead of hard-coding true?

// Once meta fields are disabled, it cant be re-enabled for a given table.
if (!getTableConfig().populateMetaFields()
&& Boolean.parseBoolean((String) properties.getOrDefault(HoodieTableConfig.POPULATE_META_FIELDS.key(), HoodieTableConfig.POPULATE_META_FIELDS.defaultValue()))) {
&& Boolean.parseBoolean((String) properties.getOrDefault(HoodieTableConfig.POPULATE_META_FIELDS.key(), HoodieTableConfig.POPULATE_META_FIELDS.defaultValue().toString()))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary? it's already being type cast to String

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not

}

/**
* TODO elaborate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo? javadoc only right?

FileSystem fs = FSUtils.getFs(pathForReader.toString(), new Configuration());
// Read the content
HoodieHFileReader<IndexedRecord> reader = new HoodieHFileReader<>(fs, pathForReader, content, Option.of(writerSchema));
HoodieHFileReader<IndexedRecord> reader = new HoodieHFileReader<>(null, pathForReader, content, Option.of(writerSchema));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could affect HFile reading. I believe there is some validation in HFile system or HFile's reader context for fs to be non-null. I think we should still pass fs and still keep this line in HoodieHFileUtils#createHFileReader:

Configuration conf = new Configuration(false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i checked it and it actually doesn't use fs at all

private List<String> getAllPartitionPathsUnchecked() {
try {
if (partitionColumns.length == 0) {
return Collections.singletonList("");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be Collections.emptyList()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-partitioned table has exactly one partition, which we designate w/ ""

public Map<String, FileStatus[]> getAllFilesInPartitions(List<String> partitions)
throws IOException {
if (partitions.isEmpty()) {
return Collections.emptyMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BaseHoodieTableFileIndex#getAllPartitionPathsUnchecked returns Collections.singletonList("") then should we add an entry for "" in the map, or rather make getAllPartitionPathsUnchecked return Collections.emptyList()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this is somewhat dissonant, but that's just the way things are -- for non-partitioned tables it's assumed that the only partition that is there has to be identified by ""


String keyGen = properties.getProperty("hoodie.datasource.write.keygenerator.class");
if (!Objects.equals(keyGen, "org.apache.hudi.keygen.NonpartitionedKeyGenerator")) {
builder.setPartitionFields("some_nonexistent_field");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract to constant to standardize across tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should actually standardize on this one, it's just to stop the bleeding in misconfigured tests


private def shouldValidatePartitionColumns(spark: SparkSession): Boolean = {
// NOTE: We can't use helper, method nor the config-entry to stay compatible w/ Spark 2.4
spark.sessionState.conf.getConfString("spark.sql.sources.validatePartitionColumns", "true").toBoolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into Spark2Adapter or Spark2ParsePartitionUtil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to

@alexeykudinkin alexeykudinkin force-pushed the ak/mtd-fl-lst-opt branch 3 times, most recently from b038d97 to 705660e Compare July 28, 2022 23:11
@codope codope added priority:critical Production degraded; pipelines stalled and removed priority:blocker Production down; release blocker labels Aug 5, 2022
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you please rebase? We can land once the CI is green.

// Make sure key-generator is configured properly
ValidationUtils.checkArgument(recordKeyField == null || !recordKeyField.isEmpty(),
"Record key field has to be non-empty!");
ValidationUtils.checkArgument(partitionPathField == null || !partitionPathField.isEmpty(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the validation message be more user-friendly? Let's say
"Partition path field has to be non-empty! For non-partitioned table, set key generator class to NonPartitionedKeyGenerator".
Also, why are these validations only added for SimpleKeyGenerator? Why not other keygens as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to put suggestions into exception messages -- exception messages should be focused on the problem triggering it, rather than on potential to remedy it (empty partition-path field is usually a sign of misconfiguration, since there's no default value, meaning that user passes "" explicitly)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.
Should we add these validations to other keygens as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. It will be taken up separately. @alexeykudinkin in case if you have a JIRA, please link it here. For simple keygen we need the validation because of misconfiguration of some tests that were passing “” as partition fields.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Sep 7, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit 4af60dc into apache:master Sep 9, 2022
Comment on lines -182 to +189
List<String> matchedPartitionPaths = FSUtils.getAllPartitionPaths(engineContext, metadataConfig, basePath)
List<String> matchedPartitionPaths = getAllPartitionPathsUnchecked()
Copy link
Member

@xushiyan xushiyan Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change affects partitioned tables that meet both of these conditions 1) hoodie.table.partition.fields not present in table config, and 2) metadata disabled. getAllPartitionPathsUnchecked() treats them as non-partitioned table, and resulted in not loading any records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core of the problem is that hoodie.table.partition.fields has to be properly configured -- the table would be considered non-partitioned by some parts of the code (outside of this one) so we need to make sure this is set properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:performance Performance optimizations priority:critical Production degraded; pipelines stalled

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants